First let’s import the FY16-FY19 property assessment data, provided by the City of Cambridge.
rawPropertyData <- read_delim('Cambridge_Property_Database_FY16-FY19.csv', delim = ',', guess_max = 10000)
## Parsed with column specification:
## cols(
## .default = col_character(),
## PID = col_double(),
## BldgNum = col_double(),
## StateClassCode = col_double(),
## LandArea = col_double(),
## YearOfAssessment = col_double(),
## BuildingValue = col_double(),
## LandValue = col_double(),
## AssessedValue = col_double(),
## SalePrice = col_double(),
## PreviousAssessedValue = col_double(),
## Exterior_NumStories = col_double(),
## Exterior_WallHeight = col_double(),
## Exterior_FloorLocation = col_double(),
## Interior_LivingArea = col_double(),
## Interior_NumUnits = col_double(),
## Interior_TotalRooms = col_double(),
## Interior_Bedrooms = col_double(),
## Interior_FullBaths = col_double(),
## Interior_HalfBaths = col_double(),
## Interior_Fireplaces = col_double()
## # ... with 7 more columns
## )
## See spec(...) for full column specifications.
# importProblems <- problems(rawPropertyData <- read_delim('Cambridge_Property_Database_FY16-FY19.csv', delim = ','))
It’s having some difficulty importing the Property Tax Amount as the first few thousand records are all integers, and so it infers an integer type. I’ve tried playing around with the guess_max option of read_delim to no avail. For now it’s not really a big deal.
Let’s take a look at what the data looks like.
summary(rawPropertyData)
## PID GISID BldgNum Address
## Min. : 3 Length:115838 Min. : 1.000 Length:115838
## 1st Qu.: 7118 Class :character 1st Qu.: 1.000 Class :character
## Median : 14206 Mode :character Median : 1.000 Mode :character
## Mean : 45961 Mean : 1.157
## 3rd Qu.: 21316 3rd Qu.: 1.000
## Max. :194531 Max. :38.000
##
## Unit StateClassCode PropertyClass Zoning
## Length:115838 Min. : 13.0 Length:115838 Length:115838
## Class :character 1st Qu.: 102.0 Class :character Class :character
## Mode :character Median : 102.0 Mode :character Mode :character
## Mean : 310.1
## 3rd Qu.: 199.0
## Max. :9971.0
##
## Map/Lot LandArea YearOfAssessment TaxDistrict
## Length:115838 Min. : 0 Min. : 0 Length:115838
## Class :character 1st Qu.: 0 1st Qu.: 2017 Class :character
## Mode :character Median : 0 Median : 2018 Mode :character
## Mean : 11703 Mean : 5990
## 3rd Qu.: 4300 3rd Qu.: 2019
## Max. :7251433 Max. :2175997
## NA's :1056
## ResidentialExemption BuildingValue LandValue
## Length:115838 Min. : -1419000 Min. : -7600
## Class :character 1st Qu.: 329800 1st Qu.: 0
## Mode :character Median : 526200 Median : 0
## Mean : 3196949 Mean : 3704024
## 3rd Qu.: 755300 3rd Qu.: 560000
## Max. :982713000 Max. :929332000
## NA's :1056
## AssessedValue SalePrice Book/Page
## Min. :0.000e+00 Min. :0.000e+00 Length:115838
## 1st Qu.:4.484e+05 1st Qu.:1.000e+00 Class :character
## Median :6.868e+05 Median :2.150e+05 Mode :character
## Mean :6.080e+06 Mean :3.732e+06
## 3rd Qu.:1.159e+06 3rd Qu.:5.200e+05
## Max. :1.380e+09 Max. :1.229e+09
##
## SaleDate PreviousAssessedValue Owner_Name
## Length:115838 Min. :0.000e+00 Length:115838
## Class :character 1st Qu.:3.956e+05 Class :character
## Mode :character Median :6.156e+05 Mode :character
## Mean :4.698e+06
## 3rd Qu.:1.031e+06
## Max. :1.295e+09
## NA's :528
## Owner_CoOwnerName Owner_Location Owner_Address
## Length:115838 Length:115838 Length:115838
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Owner_Address2 Owner_City Owner_State
## Length:115838 Length:115838 Length:115838
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Owner_Zip Exterior_Style Exterior_occupancy
## Length:115838 Length:115838 Length:115838
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Exterior_NumStories Exterior_WallType Exterior_WallHeight
## Min. : 0.000 Length:115838 Min. : 1.00
## 1st Qu.: 1.000 Class :character 1st Qu.:10.00
## Median : 1.000 Mode :character Median :12.00
## Mean : 1.615 Mean :11.24
## 3rd Qu.: 2.250 3rd Qu.:12.00
## Max. :26.000 Max. :80.00
## NA's :1979 NA's :104620
## Exterior_RoofType Exterior_RoofMaterial Exterior_FloorLocation
## Length:115838 Length:115838 Min. : 1.00
## Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Median : 2.00
## Mean : 2.79
## 3rd Qu.: 3.00
## Max. :25.00
## NA's :58906
## Exterior_View Interior_LivingArea Interior_NumUnits
## Length:115838 Min. : 0 Min. : 0
## Class :character 1st Qu.: 735 1st Qu.: 0
## Mode :character Median : 1170 Median : 1
## Mean : 3626 Mean : 850
## 3rd Qu.: 2239 3rd Qu.: 2
## Max. :956664 Max. :374995
## NA's :1979 NA's :59929
## Interior_TotalRooms Interior_Bedrooms Interior_Kitchens
## Min. : 0 Min. : 0.000 Length:115838
## 1st Qu.: 3 1st Qu.: 1.000 Class :character
## Median : 5 Median : 2.000 Mode :character
## Mean : 435 Mean : 2.552
## 3rd Qu.: 7 3rd Qu.: 3.000
## Max. :374995 Max. :495.000
## NA's :10359 NA's :11252
## Interior_FullBaths Interior_HalfBaths Interior_Fireplaces
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 0.000 Median : 0.000
## Mean : 1.549 Mean : 0.242 Mean : 0.267
## 3rd Qu.: 2.000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :321.000 Max. :40.000 Max. :24.000
## NA's :12191 NA's :12151 NA's :12151
## Interior_Flooring Interior_Layout Interior_LaundryInUnit
## Length:115838 Length:115838 Length:115838
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Systems_HeatType Systems_HeatFuel Systems_CentralAir
## Length:115838 Length:115838 Length:115838
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition
## Length:115838 Min. : 0 Length:115838
## Class :character 1st Qu.:1884 Class :character
## Mode :character Median :1908 Mode :character
## Mean :1697
## 3rd Qu.:1950
## Max. :2018
## NA's :1979
## Condition_OverallCondition Condition_OverallGrade Parking_Open
## Length:115838 Length:115838 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.000
## Mean : 0.619
## 3rd Qu.: 1.000
## Max. :80.000
## NA's :12298
## Parking_Covered Parking_Garage UnfinishedBasementGross
## Min. : 0.000 Min. : 0.00 Min. : -904.0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.000 Median : 0.00 Median : 0.0
## Mean : 0.133 Mean : 0.35 Mean : 522.7
## 3rd Qu.: 0.000 3rd Qu.: 1.00 3rd Qu.: 883.0
## Max. :45.000 Max. :99.00 Max. :77288.0
## NA's :12151 NA's :56661 NA's :1939
## FinishedBasementGross PropertyTaxAmount
## Min. : -950.0 Min. : -1
## 1st Qu.: 0.0 1st Qu.: 2986
## Median : 0.0 Median : 4573
## Mean : 122.8 Mean : 19090
## 3rd Qu.: 0.0 3rd Qu.: 7725
## Max. :50516.0 Max. :6406930
## NA's :11218 NA's :86703
Obviously there are a lot of variables here. I’ll explore many of these in more detail later, but some obvious things jump out at a glance. First, there are a lot of string values which should probably be coded as factors. This will make analysis a little bit easier. Second, there are definitely some garbage values in here. Property Tax Amount jumps out at the end here with more than half of properties showing a value of -1. And there are plenty of other weird values as well, such as basement sizes in the negatives, a Year Built of zero, one building with 374,995 units? That just seems implausible.
What do the values for YearOfAssessment look like?
ggplot(rawPropertyData, aes(x = YearOfAssessment)) + geom_histogram(bins = 50)
That doesn’t look right. What do values look like outside of a sane range?
rawPropertyData %>% filter(YearOfAssessment > 2019 | YearOfAssessment <= 1900) %>% ggplot(aes(x = YearOfAssessment)) + geom_histogram(bins = 50)
And what about in a sane range?
rawPropertyData %>% filter(YearOfAssessment > 1900 & YearOfAssessment <= 2019) %>% ggplot(aes(x = YearOfAssessment)) + geom_histogram(binwidth = 1)
Okay, so clearly there are a good amount of records with garbage values. How many exactly?
rawPropertyData %>% transmute(SaneYOA = YearOfAssessment > 1990 & YearOfAssessment <= 2019) %>% count(SaneYOA)
In the scheme of things, not so many. We can decide later if it makes sense to drop these rows or just replace with some sort of NA value.
Now, let’s take a look at the date of sale.
rawPropertyData %>% select(SaleDate) %>% head(20)
Okay, there is obviously something strange happening here. Several have the date 01/05/0001. To me that looks like a code for unknown date. Others are dated 01/01/xxxxx where xxxxx is some very large number, usually in the 30,000 to 40,000 range. I’m guessing that’s the number of days since 1900.
rawPropertyData %>% select(SaleDate) %>% mutate(chars = str_length(SaleDate)) %>% filter(chars == 10) %>% count(SaleDate) %>% arrange(desc(n))
It appears another code for unknown date is 01/01/1900. I’m not quite sure what’s going on with 01/01/1998, that seems a little suspicious. Are there any sales before 1900?
rawPropertyData %>% mutate(SaleYear = as.integer(stri_extract_last_regex(SaleDate, "[0-9]{4,5}"))) %>% filter(SaleYear > 1 & SaleYear < 1900) %>% select(Owner_Name, SaleDate, Address)
Just some City of Cambridge properties, and the dates look legitimate. So, we can take any large years and convert them to dates and perhaps set dates with years less than 1800 to NA.
Let’s start small and convert those string values into factors.
factorCols <- c(
"PID",
"StateClassCode",
"PropertyClass",
"Zoning",
"TaxDistrict",
"ResidentialExemption",
"Owner_State",
"Exterior_Style",
"Exterior_occupancy",
"Exterior_WallType",
"Exterior_RoofType",
"Exterior_RoofMaterial",
"Interior_Flooring",
"Interior_Layout",
"Interior_LaundryInUnit",
"Systems_HeatType",
"Systems_HeatFuel",
"Systems_CentralAir",
"Systems_Plumbing",
"Condition_InteriorCondition",
"Condition_OverallCondition",
"Condition_OverallGrade"
)
propertyData <- rawPropertyData %>% mutate_at(factorCols, as.factor)
Now let’s get a shorter representation of the address and GPS coordinates from the Address field.
# propertyData %>% select(Address) %>% mutate(loc = regmatches(Address, regexpr("\\(.*\\)", Address)))
propertyData <- propertyData %>% mutate(
GPSLoc = stri_extract_last_regex(Address, "\\(.*\\)"),
ShortAddress = str_remove(Address, "\\n.*\\n.*"))
Let’s correct those bogus-looking dates.
# create a date object at Jan 1 1900 as reference for corrected dates
baseDate <- as_date("01/01/1900", tz = "UTC", format = "%m/%d/%Y")
propertyData <- propertyData %>%
mutate(SaleYear = as.integer(stri_extract_last_regex(SaleDate, "[0-9]{4,5}"))) %>%
mutate(SaleDate = na_if(SaleDate, "01/05/0001")) %>%
mutate(SaleDate = na_if(SaleDate, "01/01/1900")) %>%
mutate(SaleDate = if_else(
SaleYear > 2100,
baseDate + as.period(SaleYear, unit = "days"),
as_date(SaleDate, tz = "UTC", format = "%m/%d/%Y")))
propertyData %>% count(SaleDate) %>% arrange(desc(n))
Which property ids appear the most?
propertyData %>% count(PID) %>% arrange(desc(n)) %>% head(n = 15) %>%
ggplot(aes(x = reorder(PID, n), y = n)) + geom_bar(stat = "identity") +
xlab("Property ID") + ylab("Number of Assessments")
What do the assessed properties look like from the three most common property ids?
propertyData %>% filter(PID == 11716)
propertyData %>% filter(PID == 21897)
propertyData %>% filter(PID == 12630)
Found Harvard and MIT!
What is the overall distribution of AssessedValue?
ggplot(propertyData, aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 10719 rows containing non-finite values (stat_bin).
What are the most valuable properties (by assessment) in Cambridge?
propertyData %>% arrange(desc(AssessedValue)) %>% select(PID, Owner_Name, ShortAddress, AssessedValue, Exterior_Style) %>% head(20)
Okay, that’s a boring answer. Of course the university properties are worth a billion dollars. What about excepting universities and colleges?
propertyData %>% arrange(desc(AssessedValue)) %>% filter(PropertyClass != "Private College, University" & PropertyClass != "Private College") %>% select(PID, Owner_Name, ShortAddress, AssessedValue, PropertyClass, Exterior_Style) %>% head(100)# %>% kable()
545 Technology Squre on top with $453M and $395M. Novartis also sitting on some valuable real estate. Some interesting properties lower in the data, such as the vacant utility authorities at #23 and #29, worth $305M and $274M. Residential property starts pretty low down the list at #51 and $226M, one of the Twenty20 buildings.
What are the most common types of properties?
propertyData %>% count(PropertyClass) %>% arrange(desc(n))
propertyData %>% count(PropertyClass) %>% arrange(desc(n)) %>% head(n = 10) %>% ggplot(aes(x = reorder(PropertyClass, n), y = n)) + geom_bar(stat = "identity") + xlab("Property Class") + ylab("Number of Assessments") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Okay, let’s just take a look at single family residential properties.
# let's look at just single unit residential properties
singleResUnitData <- propertyData %>%
filter(
PropertyClass == "CONDOMINIUM" | PropertyClass == "SNGL-FAM-RES" | PropertyClass == "CNDO LUX"
)
singleResUnitData %>% count(Interior_NumUnits)
What are the most valuable single family properties in the city?
singleResUnitData %>% arrange(desc(AssessedValue)) %>% filter(YearOfAssessment == 2018) %>% select(PID, Owner_Name, ShortAddress, AssessedValue, PropertyClass, Exterior_Style) %>% head(100)
The addresses sound about right.
How many assessments were performed each year?
singleResUnitData %>% count(YearOfAssessment)
How many assessments are available for each property?
singleResUnitData %>% count(PID) %>% ggplot(aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The vast majority have four assessments. Let’s create a tbl of all of the properties with four assessments.
singleResFourAssessmentPIDs <- singleResUnitData %>%
count(PID) %>% filter(n == 4) %>% select(PID)
singleResFourAssessments <- inner_join(singleResUnitData, singleResFourAssessmentPIDs)
## Joining, by = "PID"
singleResYoYAssessedValue <- singleResFourAssessments %>%
select(PID, AssessedValue, YearOfAssessment) %>%
spread(YearOfAssessment, AssessedValue)
# singleResFourAssessments %>% ggplot(aes(x = YearOfAssessment, y = AssessedValue, group = PID)) + geom_line() + scale_y_log10()
singleResFourAssessments %>% ggplot(aes(x = YearOfAssessment, group = YearOfAssessment, y = AssessedValue)) + geom_boxplot() + scale_y_log10()
singleResFourAssessments %>% group_by(YearOfAssessment) %>% summarize(mean(AssessedValue))
singleResFourAssessments %>% select(PID, SalePrice, YearOfAssessment) %>% spread(YearOfAssessment, SalePrice)
singleResFourAssessments %>% group_by(PID) %>% summarize(std = sd(SalePrice) + 1e-5) %>% ggplot(aes(x = std)) + geom_histogram(bins = 50) + scale_x_log10()
# looking at only maximum assessed value for each property, how much does
# the number of bedrooms affect assessed value?
singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>%
filter(AssessedValue > 0) %>% filter(Interior_Bedrooms <= 3) %>%
ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10() +
facet_grid(Interior_Bedrooms ~ ., scales = "free_y")
# looking at only maximum assessed value for each property, how much does
# the number of bedrooms affect assessed value?
singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>%
filter(AssessedValue > 0) %>% filter(Interior_Bedrooms <= 3) %>%
ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) +
scale_x_log10()
# what do single unit residential properties with a rational sale price
# sell for compared to assessed value?
singleResUnitData %>% filter(SalePrice > 100) %>%
ggplot(aes(x = AssessedValue, y = SalePrice, col = Interior_Bedrooms)) +
geom_point() + scale_x_log10() + scale_y_log10()
singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>%
filter(AssessedValue > 0) %>% mutate(Has_Fireplace = Interior_Fireplaces > 0) %>%
ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10() +
facet_grid(Has_Fireplace ~ ., scales = "free_y")
singleResUnitData %>% filter(YearOfAssessment == 2018) %>%
mutate(YearsSinceSold = 2018 - year(SaleDate)) %>% filter(YearsSinceSold > 0) %>%
ggplot(aes(x = YearsSinceSold)) + geom_histogram(binwidth = 1)
Let’s initially look at properties listed as sold in 2014. I’ll use the 2016 assessments as they appear to be the most complete. I’ll also filter out extremely low valued sales although it may be interesting to see if we can classify when that type of sale will happen.
singleResUnitData %>% filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014)
singleResUnitData %>% filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014) %>% select_if(is.numeric) %>% summarise_all(sd) %>% t()
## [,1]
## BldgNum 0.000000e+00
## LandArea 2.268142e+03
## YearOfAssessment 0.000000e+00
## BuildingValue 3.534488e+05
## LandValue 2.931463e+05
## AssessedValue 5.008223e+05
## SalePrice 5.623093e+05
## PreviousAssessedValue 4.572806e+05
## Exterior_NumStories 7.783219e-01
## Exterior_WallHeight NA
## Exterior_FloorLocation NA
## Interior_LivingArea 6.973143e+02
## Interior_NumUnits NA
## Interior_TotalRooms 1.923948e+00
## Interior_Bedrooms 1.024363e+00
## Interior_FullBaths 7.292727e-01
## Interior_HalfBaths 4.814183e-01
## Interior_Fireplaces 6.723882e-01
## Condition_YearBuilt 1.257303e+02
## Parking_Open 7.701008e-01
## Parking_Covered 3.325217e-01
## Parking_Garage NA
## UnfinishedBasementGross 3.200294e+02
## FinishedBasementGross 2.376105e+02
## PropertyTaxAmount NA
## SaleYear 9.324429e+01
Some of these standard deviations are coming up NA or zero so we’ll have to cull them (for now at least) to get rational results.
singleResSaleTrain <- singleResUnitData %>%
filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014) %>%
filter(SalePrice > 1000) %>%
mutate(Interior_TotalRooms = recode(Interior_TotalRooms, `0` = 1),
InUnitWD = Interior_LaundryInUnit == 1) %>%
select(-c(BldgNum, YearOfAssessment, Exterior_WallHeight, Exterior_FloorLocation,
Interior_NumUnits, PropertyTaxAmount, Parking_Garage, SaleYear))
singleResSaleTrain %>% select_if(is.numeric) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE)
## SalePrice AssessedValue PreviousAssessedValue
## 1.00000000 0.95730203 0.84151737
## Interior_LivingArea BuildingValue LandValue
## 0.83037669 0.80751596 0.69812040
## LandArea Interior_TotalRooms Interior_FullBaths
## 0.64863353 0.64222533 0.61179475
## Interior_Fireplaces Interior_Bedrooms UnfinishedBasementGross
## 0.60409800 0.58374500 0.54012092
## Interior_HalfBaths Parking_Open Exterior_NumStories
## 0.45837096 0.43104066 0.40662074
## Parking_Covered FinishedBasementGross Condition_YearBuilt
## 0.38509781 0.35216095 -0.03327682
As expected, AssessedValue and PreviousAssessedValue have very high correlation to the sale price. Interior_LivingArea is very highly correlated as well which shouldn’t be too surprising.
for (factorCol in factorCols) {
if (factorCol != "PID") {
print(factorCol)
singleResSaleTrain %>% select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% spread(factorCol, one, fill = 0) %>% select(-PID) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% print()
}
}
## [1] "StateClassCode"
## SalePrice 101 1021 102
## 1.00000000 0.45386809 0.03706046 -0.38583711
## [1] "PropertyClass"
## SalePrice SNGL-FAM-RES CNDO LUX CONDOMINIUM
## 1.00000000 0.45386809 0.03706046 -0.38583711
## [1] "Zoning"
## SalePrice A-2 A-1 B C-1
## 1.000000000 0.493464548 0.228009911 0.177753670 0.097570037
## C-3 C-2 C SD-2 O-1
## 0.065494172 0.013429479 0.007343891 -0.006162950 -0.017796302
## BA <NA>
## -0.026337659 -0.397953184
## [1] "TaxDistrict"
## SalePrice R10 R6 R11 R14
## 1.000000000 0.438026152 0.167784252 0.166326699 0.118820706
## R9 R15 R16 R8 R4
## 0.080227125 0.050117107 0.020045536 0.009295112 0.004945910
## R5 R17 R3 R1A R13
## 0.003370830 -0.026501863 -0.035182866 -0.040098555 -0.041477920
## R12 R1 R2 R7
## -0.072763471 -0.081459706 -0.093449420 -0.150073844
## [1] "ResidentialExemption"
## SalePrice 1 0
## 1.000000000 0.006350188 -0.006350188
## [1] "Owner_State"
## SalePrice CO ME NY MI
## 1.000000000 0.352902598 0.127051362 0.021636078 0.008468584
## PA -- GA VA CT
## 0.001218041 -0.004500393 -0.006778033 -0.013543941 -0.016250305
## NH NJ TN TX -
## -0.019147223 -0.019756276 -0.020002309 -0.021557880 -0.021847556
## FL MD VT CA MA
## -0.023139230 -0.025599649 -0.034395282 -0.035554834 -0.054221470
## [1] "Exterior_Style"
## SalePrice COLONIAL VICTORIAN CONVENTIONAL
## 1.000000000 0.427072996 0.334634721 0.138025531
## CONTEMPORARY ROW-END SINGLE SEMIDETACHED
## 0.121263652 0.108869637 0.097298628 0.066579635
## TWO-FLOORS TOWNHOUSE OTHER CAPE-COD
## 0.043814846 0.042463923 0.041607030 0.036941118
## ROW RANCH FREE-STANDING GARRISON
## 0.032012397 0.031071628 0.021372631 -0.004932785
## TOWNHSE-END ATTIC Commercial Condo BASEMENT
## -0.015630126 -0.018471762 -0.023323755 -0.083686929
## FLAT APARTMENT
## -0.144854633 -0.268979306
## [1] "Exterior_occupancy"
## SalePrice SNGL-FAM-RES <NA>
## 1.0000000 0.4538681 -0.4538681
## [1] "Exterior_WallType"
## SalePrice Frame-Clapbrd Brick Wood Shingle Concrete Block
## 1.000000000 0.344484651 0.279351777 0.150615224 0.091143116
## Aluminum-Vinyl Asbstos Shingl Other Stucco <NA>
## 0.057735497 0.049015333 0.035246926 -0.008008198 -0.453868089
## [1] "Exterior_RoofType"
## SalePrice Hip Gambrel Gable Mansard Flat
## 1.00000000 0.40407433 0.24440946 0.22299574 0.16314565 0.08891257
## Shed <NA>
## 0.01110098 -0.45386809
## [1] "Exterior_RoofMaterial"
## SalePrice Slate Clay Aspahlt Shingl Other Rubber Membrai
## 1.00000000 0.45948037 0.24998849 0.09515368 0.06512724
## Roll Roofing <NA>
## 0.03107163 -0.45386809
## [1] "Interior_Flooring"
## SalePrice <NA> Cermc/Qry Tile Softwood Vinyl/Linoleum
## 1.00000000 0.43152723 -0.02154002 -0.02482636 -0.03187432
## Carpet Hardwood
## -0.07443793 -0.34601312
## [1] "Interior_Layout"
## SalePrice <NA> Superior Corner Unit Poor Thru Unit
## 1.00000000 0.42986596 0.04181349 0.03178377 -0.01940709 -0.14152009
## No Impact
## -0.19519846
## [1] "Interior_LaundryInUnit"
## SalePrice <NA> 1 0
## 1.0000000 0.4538681 -0.0696832 -0.2861683
## [1] "Systems_HeatType"
## SalePrice Other Forced Air Elec-Radiant Space Heat
## 1.0000000000 0.2230587252 0.1808004957 -0.0003196656 -0.0115141688
## H.V.A.C HW Radiator Elec Wall <NA> Hot Water
## -0.0173405528 -0.0186111036 -0.0216308218 -0.0685689647 -0.0698772827
## Heat Pump Steam
## -0.0928303465 -0.1067346127
## [1] "Systems_HeatFuel"
## SalePrice Gas Oil Electric <NA>
## 1.0000000 0.4171979 0.1473439 0.0506797 -0.4538681
## [1] "Systems_CentralAir"
## SalePrice 1 0 <NA>
## 1.0000000 0.4504958 0.1746491 -0.4538681
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Condition_InteriorCondition"
## SalePrice Excellent Good Fair Average Very Good
## 1.00000000 0.43744545 0.14377645 0.11667631 0.07758647 0.07658401
## <NA>
## -0.45386809
## [1] "Condition_OverallCondition"
## SalePrice Superior Excellent Fair Poor Average
## 1.00000000 0.36362289 0.11760053 0.05898255 -0.01420357 -0.04973421
## Very Good Good
## -0.07650985 -0.10228084
## [1] "Condition_OverallGrade"
## SalePrice Estate Quality Good Very Good Very Good Ex Excellent
## 1.000000000 0.353967910 0.322367077 0.274903062 0.267762467
## Very Good Average Plus Poor Fair Good
## 0.182489517 -0.003498485 -0.013072140 -0.038241746 -0.132633012
## Average
## -0.245905467
There are quite a few interesting results here! One that jumps out immediately is that the conditions and grades don’t necessarily go in order. An interior condition of “Very Good” is less correlated with a high sale price than a condition of “Good”. It’s pretty obvious there are some correlations between that feature and other features which are skewing the correlations. It’s also unclear how many properties actually were given each ranking. In fact, let’s count up the instance of each level of the factor features.
# singleResSaleTrain %>% select(-PID) %>% select_if(is.factor) %>% sum()
for (factorCol in factorCols) {
if (factorCol != "PID") {
print(factorCol)
singleResSaleTrain %>% select(factorCol) %>% count()
# singleResSaleTrain %>% select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% spread(factorCol, one, fill = 0) %>% select(-PID) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% print()
}
}
## [1] "StateClassCode"
## [1] "PropertyClass"
## [1] "Zoning"
## [1] "TaxDistrict"
## [1] "ResidentialExemption"
## [1] "Owner_State"
## [1] "Exterior_Style"
## [1] "Exterior_occupancy"
## [1] "Exterior_WallType"
## [1] "Exterior_RoofType"
## [1] "Exterior_RoofMaterial"
## [1] "Interior_Flooring"
## [1] "Interior_Layout"
## [1] "Interior_LaundryInUnit"
## [1] "Systems_HeatType"
## [1] "Systems_HeatFuel"
## [1] "Systems_CentralAir"
## [1] "Systems_Plumbing"
## [1] "Condition_InteriorCondition"
## [1] "Condition_OverallCondition"
## [1] "Condition_OverallGrade"
singleResSaleTrain %>% group_by(Exterior_Style) %>% tally() %>% arrange(desc(n))
I think there may be some correlations between building materials and style. Let’s see how serious they are.
singleResSaleTrain %>% group_by(Exterior_Style, Exterior_WallType) %>% tally() %>% arrange(desc(n))
## Warning: Factor `Exterior_WallType` contains implicit NA, consider using
## `forcats::fct_explicit_na`
Actually, it appears the vast majority of sales are of apartments, flats, townhouses, etc which do not have exterior walls or they are not recorded.
singleResSaleTrain %>% group_by(Exterior_Style, Exterior_occupancy) %>% tally() %>% arrange(Exterior_Style, desc(n))
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
singleResSaleTrain %>% group_by(Exterior_occupancy, Exterior_RoofType) %>% tally() %>% arrange(Exterior_RoofType)
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `Exterior_RoofType` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
singleResSaleTrain %>% filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy))# %>% summarize_all(funs(sum(is.na(.)))) %>% t()
for (factorCol in factorCols) {
if (factorCol != "PID") {
print(factorCol)
singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>%
select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>%
spread(factorCol, one, fill = 0) %>% select(-PID) %>%
cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>%
print()
}
}
## [1] "StateClassCode"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "PropertyClass"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Zoning"
## SalePrice A-2 A-1 C-3 <NA> C-2
## 1.00000000 0.55615298 0.27505796 0.04458499 0.03129629 -0.06069310
## SD-2 O-1 C B C-1
## -0.06716139 -0.07003898 -0.13273946 -0.18947486 -0.19222867
## [1] "TaxDistrict"
## SalePrice R10 R6 R14 R3 R16
## 1.00000000 0.53026907 0.32391108 0.17204876 0.04114921 0.03892573
## R9 R15 R4 R11 R8 R12
## 0.02725509 0.02060508 0.01573571 -0.04170017 -0.07490416 -0.10121535
## R17 R2 R13 R1 R7
## -0.10419031 -0.11164874 -0.14267194 -0.18777370 -0.33155425
## [1] "ResidentialExemption"
## SalePrice 0 1
## 1.0000000 0.1301559 -0.1301559
## [1] "Owner_State"
## SalePrice CO NY -- NH CA
## 1.00000000 0.49444808 0.01189068 -0.01872197 -0.07339616 -0.09637219
## MA
## -0.10966525
## [1] "Exterior_Style"
## SalePrice COLONIAL VICTORIAN CONTEMPORARY ROW-END
## 1.000000000 0.435012257 0.299161350 0.108033551 0.055042910
## RANCH OTHER CAPE-COD GARRISON ROW
## -0.009095571 -0.047497861 -0.058268591 -0.065242994 -0.066032977
## TOWNHSE-END TOWNHOUSE CONVENTIONAL
## -0.169153091 -0.201823100 -0.315036054
## [1] "Exterior_occupancy"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Exterior_WallType"
## SalePrice Brick Frame-Clapbrd Concrete Block Other
## 1.00000000 0.25236388 0.10972323 0.08458348 -0.02655564
## Asbstos Shingl Stucco Wood Shingle Aluminum-Vinyl
## -0.05324821 -0.07003898 -0.10862346 -0.18222837
## [1] "Exterior_RoofType"
## SalePrice Hip Gambrel Mansard Flat
## 1.000000000 0.426213845 0.268563513 0.064714966 -0.002745705
## Shed Gable
## -0.149648912 -0.346627889
## [1] "Exterior_RoofMaterial"
## SalePrice Slate Clay Other Roll Roofing Rubber Membrai
## 1.000000000 0.479088666 0.067179639 -0.009095571 -0.013866012
## Aspahlt Shingl
## -0.443698828
## [1] "Interior_Flooring"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Interior_Layout"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Interior_LaundryInUnit"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Systems_HeatType"
## SalePrice Other Forced Air Elec-Radiant Steam
## 1.00000000 0.24973498 0.11665633 -0.05804902 -0.07400324
## Hot Water
## -0.13587871
## [1] "Systems_HeatFuel"
## SalePrice Gas Electric Oil
## 1.00000000 0.01054795 -0.00240819 -0.01017293
## [1] "Systems_CentralAir"
## SalePrice 1 0
## 1.0000000 0.3140731 -0.3140731
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Condition_InteriorCondition"
## SalePrice Excellent Fair Average Very Good Good
## 1.00000000 0.32969302 0.04129457 -0.11722781 -0.12704186 -0.18623783
## [1] "Condition_OverallCondition"
## SalePrice Superior Excellent Fair Average Very Good
## 1.00000000 0.45636459 0.14898957 0.02305601 -0.04194242 -0.10566094
## Good
## -0.22385606
## [1] "Condition_OverallGrade"
## SalePrice Estate Quality Excellent Very Good Very Good Ex
## 1.00000000 0.49444808 0.35056863 0.34738030 0.30382588
## Good Very Good Average Plus Poor Good Fair
## 0.21450636 -0.08718006 -0.10215983 -0.13468674 -0.19886541
## Average
## -0.32919019
for (factorCol in factorCols) {
if (factorCol != "PID") {
print(factorCol)
singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy)) %>%
select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>%
spread(factorCol, one, fill = 0) %>% select(-PID) %>%
cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>%
print()
}
}
## [1] "StateClassCode"
## SalePrice 1021 102
## 1.0000000 0.1746209 -0.1746209
## [1] "PropertyClass"
## SalePrice CNDO LUX CONDOMINIUM
## 1.0000000 0.1746209 -0.1746209
## [1] "Zoning"
## SalePrice <NA> B O-1 C BA
## 1.00000000 0.04263609 0.02390421 -0.01219114 -0.01747849 -0.03286140
## C-1
## -0.05726753
## [1] "TaxDistrict"
## SalePrice R11 R10 R16 R9
## 1.000000000 0.307633640 0.270088570 0.043850510 0.043542008
## R8 R6 R1A R5 R17
## 0.030623345 0.026802738 0.024870189 0.015332130 0.011480866
## R13 R14 R4 R3 R1
## 0.006124646 0.002560371 -0.005545813 -0.030978824 -0.039599006
## R12 R2 R7
## -0.047894406 -0.090967556 -0.154776056
## [1] "ResidentialExemption"
## SalePrice 1 0
## 1.00000000 0.03926992 -0.03926992
## [1] "Owner_State"
## SalePrice CO ME MI PA
## 1.000000000 0.244825125 0.220081600 0.027697645 0.011839845
## NY GA VA NH CT
## 0.002116040 -0.001131498 -0.012107250 -0.012402937 -0.016497551
## TX NJ TN -- MD
## -0.021016090 -0.022184986 -0.022584104 -0.023557677 -0.024429054
## - FL CA MA VT
## -0.025577491 -0.027672862 -0.028798843 -0.037158726 -0.041842977
## [1] "Exterior_Style"
## SalePrice TWO-FLOORS SINGLE TOWNHOUSE
## 1.00000000 0.18629841 0.17497417 0.14822622
## SEMIDETACHED OTHER FREE-STANDING TOWNHSE-END
## 0.14396342 0.09285234 0.05892525 0.02296720
## ATTIC Commercial Condo FLAT BASEMENT
## -0.01286401 -0.02797220 -0.09437659 -0.10148348
## APARTMENT
## -0.18673767
## [1] "Exterior_occupancy"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Exterior_WallType"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Exterior_RoofType"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Exterior_RoofMaterial"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Interior_Flooring"
## SalePrice Hardwood Softwood Cermc/Qry Tile Vinyl/Linoleum
## 1.00000000 0.09559675 -0.01605328 -0.02507859 -0.03460975
## <NA> Carpet
## -0.05736854 -0.06698034
## [1] "Interior_Layout"
## SalePrice Corner Unit Superior Thru Unit Poor
## 1.000000000 0.150749197 0.077694357 -0.001147911 -0.009381232
## <NA> No Impact
## -0.052504954 -0.088404924
## [1] "Interior_LaundryInUnit"
## SalePrice 1 0
## 1.0000000 0.2994261 -0.2994261
## [1] "Systems_HeatType"
## SalePrice Forced Air Elec Wall HW Radiator H.V.A.C
## 1.000000000 0.284747170 0.004698664 0.004189207 -0.003904207
## Space Heat Heat Pump <NA> Steam Hot Water
## -0.008814524 -0.074137119 -0.081571156 -0.144316175 -0.150656296
## [1] "Systems_HeatFuel"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Systems_CentralAir"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Condition_InteriorCondition"
## Warning in cor(.): the standard deviation is zero
## SalePrice
## 1
## [1] "Condition_OverallCondition"
## SalePrice Excellent Poor Fair Very Good
## 1.000000000 0.215873693 -0.009084755 -0.030600922 -0.036106529
## Average Good
## -0.128264658 -0.144040027
## [1] "Condition_OverallGrade"
## SalePrice Excellent Very Good Good Fair Average
## 1.00000000 0.43626835 0.23764798 -0.06057705 -0.08371644 -0.25037003
singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>%
select_if(is.numeric) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE)
## SalePrice AssessedValue PreviousAssessedValue
## 1.0000000 0.9737212 0.9400675
## BuildingValue LandValue Interior_LivingArea
## 0.9126605 0.8771784 0.8448729
## Interior_Fireplaces LandArea Interior_TotalRooms
## 0.7305289 0.7202855 0.6453451
## Interior_FullBaths Interior_Bedrooms UnfinishedBasementGross
## 0.6286787 0.4852795 0.4505377
## Parking_Open Parking_Covered Interior_HalfBaths
## 0.4192400 0.4179977 0.4140147
## Exterior_NumStories FinishedBasementGross Condition_YearBuilt
## 0.3920473 0.2082721 -0.1245507
singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy)) %>%
select_if(is.numeric) %>% select(-c(LandArea, LandValue)) %>% cor() %>% .[,"SalePrice"] %>%
sort(decreasing = TRUE)
## SalePrice BuildingValue AssessedValue
## 1.00000000 0.91508242 0.91508242
## Interior_LivingArea PreviousAssessedValue Interior_FullBaths
## 0.70852811 0.64067864 0.53914213
## Interior_TotalRooms Interior_Bedrooms Interior_HalfBaths
## 0.47969877 0.47008078 0.35268607
## FinishedBasementGross Interior_Fireplaces Exterior_NumStories
## 0.32622401 0.29460666 0.29417724
## UnfinishedBasementGross Parking_Open Condition_YearBuilt
## 0.20572019 0.17026747 -0.01757395
## Parking_Covered
## -0.03608285
singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>%
select(SalePrice, Interior_Fireplaces) %>%
mutate(Has_Fireplace = Interior_Fireplaces > 0,
TwoPlus_FP = Interior_Fireplaces > 1,
Multi_FP = Interior_Fireplaces > 2) %>%
cor() %>% .[,"SalePrice"] %>%
sort(decreasing = TRUE)
## SalePrice Interior_Fireplaces Multi_FP
## 1.0000000 0.7305289 0.6076482
## TwoPlus_FP Has_Fireplace
## 0.5918606 0.4716837
I think it makes sense to try developing two seperate models for sale price before attempting to combine the two.
asfTrain <- singleResSaleTrain %>%
filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy))
head(asfTrain)
data {
int<lower=0> n;
real area[n];
real sale_price[n];
}
parameters {
real a;
real b_area;
real<lower=0> sig;
}
model {
a ~ normal(500000, 10000);
b_area ~ normal(0, 10);
sig ~ cauchy(0, 10);
for (i in 1:n) {
sale_price ~ normal(a + b_area * area[i], sig);
}
}
subset_train <- asfTrain[1:500,]
prop_dat <- list(
n = nrow(subset_train),
area = as.numeric(subset_train$Interior_LivingArea),
rooms = subset_train$Interior_TotalRooms,
bedrooms = subset_train$Interior_Bedrooms,
halfbaths = subset_train$Interior_HalfBaths,
fullbaths = subset_train$Interior_FullBaths,
sale_price = subset_train$SalePrice
)
att_area_fit <- sampling(att_area, data = prop_dat)
print(att_area_fit)
## Inference for Stan model: 4578c88cf133b09aa230a27c880f48da.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50%
## a 638757.78 37.53 1758.35 635309.79 637607.2 638744.6
## b_area 3.43 0.03 1.45 0.62 2.5 3.4
## sig 350556.45 15.70 488.46 349635.09 350222.1 350551.6
## lp__ -3316941.80 0.03 1.18 -3316944.81 -3316942.4 -3316941.5
## 75% 97.5% n_eff Rhat
## a 639909.06 642229.19 2195 1
## b_area 4.39 6.32 2280 1
## sig 350868.90 351558.78 968 1
## lp__ -3316940.91 -3316940.43 1447 1
##
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 22:51:33 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
plot(att_area_fit)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
data {
int<lower=0> n;
real rooms[n];
real bedrooms[n];
real sale_price[n];
}
transformed data {
real otherrooms[n];
for (i in 1:n) {
otherrooms[i] = rooms[i] - bedrooms[i];
}
}
parameters {
real a;
real b_bedrooms;
real b_otherrooms;
real<lower=0> sig;
}
model {
a ~ normal(500000, 10000);
b_bedrooms ~ normal(0, 10000);
b_otherrooms ~ normal(0, 10000);
sig ~ cauchy(0, 10);
for (i in 1:n) {
sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i], sig);
}
}
subset_train <- asfTrain[1:500,]
prop_dat <- list(
n = nrow(subset_train),
area = as.numeric(subset_train$Interior_LivingArea),
rooms = subset_train$Interior_TotalRooms,
bedrooms = subset_train$Interior_Bedrooms,
halfbaths = subset_train$Interior_HalfBaths,
fullbaths = subset_train$Interior_FullBaths,
sale_price = subset_train$SalePrice
)
att_rooms_fit <- sampling(att_rooms, data = prop_dat)
print(att_rooms_fit)
## Inference for Stan model: 6d17a87868e93b20c80e2663ebcc758f.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25%
## a 635860.69 49.02 2279.42 631325.3 634382.25
## b_bedrooms 1456.38 19.18 907.50 -335.8 827.23
## b_otherrooms 1284.91 15.45 709.80 -68.3 803.49
## sig 350580.18 8.97 495.90 349605.8 350243.48
## lp__ -3316940.23 0.04 1.39 -3316943.7 -3316940.95
## 50% 75% 97.5% n_eff Rhat
## a 635856.15 637390.27 640280.45 2162 1
## b_bedrooms 1469.18 2088.33 3197.36 2239 1
## b_otherrooms 1274.23 1768.41 2749.14 2112 1
## sig 350571.55 350915.93 351546.41 3054 1
## lp__ -3316939.95 -3316939.20 -3316938.46 1439 1
##
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 22:56:52 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
data {
int<lower=0> n;
real rooms[n];
real bedrooms[n];
real area[n];
real sale_price[n];
}
transformed data {
real otherrooms[n];
real roomsize[n];
for (i in 1:n) {
otherrooms[i] = rooms[i] - bedrooms[i];
roomsize[i] = area[i] / rooms[i];
}
}
parameters {
real a;
real b_bedrooms;
real b_otherrooms;
real b_roomsize;
real<lower=0> sig;
}
model {
a ~ normal(500000, 10000);
b_bedrooms ~ normal(0, 10000);
b_otherrooms ~ normal(0, 10000);
b_roomsize ~ normal(0, 1000);
sig ~ cauchy(0, 10);
for (i in 1:n) {
sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i] + b_roomsize * roomsize[i], sig);
}
}
print(att_rooms_area_fit)
## Inference for Stan model: a993496f44f32b13b1135e5c7fcb7e7b.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25%
## a 631853.17 39.04 2259.77 627530.40 630288.74
## b_bedrooms 642.47 18.02 672.15 -637.67 172.13
## b_otherrooms 1019.88 10.75 554.51 -80.23 640.88
## b_roomsize 11.67 0.08 5.57 0.25 8.03
## sig 352941.37 8.30 352.49 352255.87 352697.22
## lp__ -6504396.44 0.04 1.48 -6504400.07 -6504397.23
## 50% 75% 97.5% n_eff Rhat
## a 631793.51 633379.38 636307.84 3350 1
## b_bedrooms 655.67 1104.75 1930.03 1392 1
## b_otherrooms 1030.34 1398.22 2075.59 2663 1
## b_roomsize 11.66 15.44 22.39 4375 1
## sig 352935.75 353186.92 353635.20 1802 1
## lp__ -6504396.15 -6504395.36 -6504394.46 1677 1
##
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 23:11:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
data {
int<lower=0> n;
real rooms[n];
real bedrooms[n];
real area[n];
real sale_price[n];
real wd[n];
}
transformed data {
real otherrooms[n];
real roomsize[n];
for (i in 1:n) {
otherrooms[i] = rooms[i] - bedrooms[i];
roomsize[i] = area[i] / rooms[i];
}
}
parameters {
real a;
real b_bedrooms;
real b_otherrooms;
real b_roomsize;
real b_wd;
real<lower=0> sig;
}
model {
a ~ normal(500000, 10000);
b_bedrooms ~ normal(0, 10000);
b_otherrooms ~ normal(0, 10000);
b_roomsize ~ normal(0, 1000);
b_wd ~ normal(0, 1000);
sig ~ cauchy(0, 10);
for (i in 1:n) {
sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i] + b_roomsize * roomsize[i] + b_wd * wd[i], sig);
}
}
print(att_rooms_area_wd_fit)
## Inference for Stan model: e97e93a1d75b860a2d2a45a831123245.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50%
## a 575382.73 108.21 5309.96 564942.09 571729.45 575349.51
## b_bedrooms 2210.26 31.86 1602.44 -851.09 1106.20 2222.23
## b_otherrooms 4437.91 28.39 1398.58 1632.50 3515.63 4446.85
## b_roomsize 51.06 0.26 14.31 23.42 41.20 51.13
## b_wd 52.50 21.68 957.17 -1920.17 -566.51 62.37
## sig 258249.34 13.79 921.87 256419.18 257626.96 258263.29
## lp__ -518506.57 0.05 1.70 -518510.73 -518507.49 -518506.27
## 75% 97.5% n_eff Rhat
## a 578957.48 585688.00 2408 1
## b_bedrooms 3260.42 5365.73 2530 1
## b_otherrooms 5353.74 7180.09 2426 1
## b_roomsize 60.80 80.01 2954 1
## b_wd 704.35 1890.35 1950 1
## sig 258870.93 260020.92 4468 1
## lp__ -518505.32 -518504.19 1287 1
##
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 23:14:16 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
# what are the physical locations of properties?